Fishermen like to catch record-breaking large fish and if successful will do a bit of effort to document their historic achievement in the local pub ;-) This dataset contains sizes of such record catches of Hucho in various Austrian streams and rivers of various size, the data was collected from various sources including the eventual black-and-white photograph hanging in a pub.
Here is one such picture from the author of the study:
The Danube salmon is an endangered species in now dwindling populations. It needs intact river corridors for migration. Hydropower facilities should have a fish ladder allowing the fish to bypass turbines. The size of such a fish ladder is a cost issue AND an ecological issue - it needs to be large enough to accommodate the expected fish size in any given system. Usually large rivers host large fish. So, river size could be taken as a proxy of fish size to be expected.
hucho = read.table("data/HuchoRatschan2012.txt", header = TRUE)
names(hucho)
## [1] "river" "population" "length" "mass" "discharge"
## [6] "width"
plot(length~width,data=hucho)
plot(mass~log(discharge),data=hucho) # I go for this one, but several others may also make sense
humo1<-lm(mass~log(discharge),data=hucho)
summary(humo1)
##
## Call:
## lm(formula = mass ~ log(discharge), data = hucho)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0650 -3.7538 0.5911 2.5827 9.0719
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.8898 1.9967 7.457 2.6e-08 ***
## log(discharge) 2.4026 0.5433 4.422 0.000118 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.908 on 30 degrees of freedom
## Multiple R-squared: 0.3946, Adjusted R-squared: 0.3744
## F-statistic: 19.56 on 1 and 30 DF, p-value: 0.0001184
abline(humo1)
coef(humo1)
## (Intercept) log(discharge)
## 14.88982 2.40255
Hucho model1: fish mass = 14.9+2.4 * log(Q)
boxplot(mass~population,data=hucho)
# anova-problem
hist(hucho$mass[hucho$population=="limited"]) # ND will anyway be hard to judge...
qqnorm(hucho$mass[hucho$population=="limited"])
qqnorm(hucho$mass[hucho$population=="medium"])
qqnorm(hucho$mass[hucho$population=="large"])
hist(unlist(tapply(hucho$mass,INDEX=hucho$population,FUN=scale)))
qqnorm(unlist(tapply(hucho$mass,INDEX=hucho$population,FUN=scale)))
shapiro.test(unlist(tapply(hucho$mass,INDEX=hucho$population,FUN=scale))) # check ND
##
## Shapiro-Wilk normality test
##
## data: unlist(tapply(hucho$mass, INDEX = hucho$population, FUN = scale))
## W = 0.95183, p-value = 0.1624
bartlett.test(mass~population,data=hucho) # check var.hom.
##
## Bartlett test of homogeneity of variances
##
## data: mass by population
## Bartlett's K-squared = 3.0522, df = 2, p-value = 0.2174
humo2<-lm(mass~population,data=hucho)
summary(humo2)
##
## Call:
## lm(formula = mass ~ population, data = hucho)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0692 -4.1808 -0.6769 3.5511 12.0308
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.785 1.638 15.130 2.69e-15 ***
## populationlimited -6.643 2.915 -2.279 0.0302 *
## populationmedium -1.715 2.317 -0.740 0.4650
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.906 on 29 degrees of freedom
## Multiple R-squared: 0.1527, Adjusted R-squared: 0.09425
## F-statistic: 2.613 on 2 and 29 DF, p-value: 0.0905
anova(humo2)
## Analysis of Variance Table
##
## Response: mass
## Df Sum Sq Mean Sq F value Pr(>F)
## population 2 182.29 91.144 2.6129 0.0905 .
## Residuals 29 1011.61 34.883
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Anova population size does not affect fish size significantly
pairwise.t.test(hucho$mass,hucho$population,method="bonferroni",pool.sd=TRUE)
##
## Pairwise comparisons using t tests with pooled SD
##
## data: hucho$mass and hucho$population
##
## large limited
## limited 0.091 -
## medium 0.465 0.203
##
## P value adjustment method: holm
# pairwise no significant difference!
Population size does not significantly influence maximum fish size (F=2.6, df1=2, df2=29, P=0.09), but there is a tendency for the limited populations to have smaller fish on average.
hucho$population<-factor(hucho$population)
levels(hucho$population) # check order of levels
## [1] "large" "limited" "medium"
plot(mass~log(discharge),data=hucho,col=hucho$population)
abline(humo1)
boxplot(humo1$residuals~hucho$population) # this residual analysis strongly hints at an effect of population!
humo3<-lm(mass~log(discharge)*population,data=hucho)
summary(humo3)
##
## Call:
## lm(formula = mass ~ log(discharge) * population, data = hucho)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8704 -3.2566 -0.0332 1.8738 7.7866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.1644 3.5591 4.261 0.000236 ***
## log(discharge) 2.7973 0.9694 2.886 0.007753 **
## populationlimited 0.2723 5.7999 0.047 0.962911
## populationmedium -0.2760 4.3105 -0.064 0.949445
## log(discharge):populationlimited -1.9670 1.6120 -1.220 0.233328
## log(discharge):populationmedium -0.2446 1.1679 -0.209 0.835716
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.494 on 26 degrees of freedom
## Multiple R-squared: 0.5602, Adjusted R-squared: 0.4757
## F-statistic: 6.624 on 5 and 26 DF, p-value: 0.0004239
anova(humo3)
## Analysis of Variance Table
##
## Response: mass
## Df Sum Sq Mean Sq F value Pr(>F)
## log(discharge) 1 471.13 471.13 23.3305 5.261e-05 ***
## population 2 163.37 81.68 4.0450 0.02955 *
## log(discharge):population 2 34.36 17.18 0.8508 0.43863
## Residuals 26 525.04 20.19
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
humo4<-lm(mass~log(discharge)+population,data=hucho)
anova(humo4)
## Analysis of Variance Table
##
## Response: mass
## Df Sum Sq Mean Sq F value Pr(>F)
## log(discharge) 1 471.13 471.13 23.5818 4.11e-05 ***
## population 2 163.37 81.68 4.0886 0.02768 *
## Residuals 28 559.40 19.98
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(humo4)
##
## Call:
## lm(formula = mass ~ log(discharge) + population, data = hucho)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8325 -2.7893 -0.0949 1.9198 8.1081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.6708 2.1084 7.907 1.30e-08 ***
## log(discharge) 2.3593 0.4959 4.758 5.38e-05 ***
## populationlimited -6.2152 2.2079 -2.815 0.00883 **
## populationmedium -1.1626 1.7570 -0.662 0.51357
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.47 on 28 degrees of freedom
## Multiple R-squared: 0.5315, Adjusted R-squared: 0.4812
## F-statistic: 10.59 on 3 and 28 DF, p-value: 7.988e-05
plot(mass~log(discharge),data=hucho,col=c("green","red","black"))
abline(a=coef(humo4)[1],b=coef(humo4)[2],col="green")
abline(a=coef(humo4)[1]+coef(humo4)[3],b=coef(humo4)[2],col="red")
abline(a=coef(humo4)[1]+coef(humo4)[4],b=coef(humo4)[2],col="black")
The best model to estimate fish ladder dimension comes from mass~log(Q) using only data from large populations:
Hucho model4: Fish mass = 16.67 +2.36 log(Q)
which has similar slope but higher intercept than the model that ignores population size:
Hucho model1: Fish mass = 14.9+2.4 * log(Q)
With model1 fish would be estimated too small, fish passes would not support the largest fish that we could hope for in eventually healthy populations.
Load the LakeSPM dataset (Lindström, M., Håkanson, L., Abrahamsson, O., Johansson, H. (1999) An empirical model for prediction of lake water suspended particulate matter. Ecological Modelling 121, 185–198).
lakes <- read.table("data/LakeSPM.txt",header=TRUE)
The study has a set of lakes with SPM=solid particulate matter as the main response of interest. High SPM is a water quality issue, it can be generated through excessive phytoplankton growth or resuspension of sediment in shallow lakes or terrigeneous input into smaller lakes. For water quality prediction a model to predict SPM from easily accessible lake data should be generated. The available predictors (check Excel sheet in data folder for exact meaning of abbreviations) include
# choose variables and run MLR
plot(lakes[,c('spm','TP','Dm')])
plot(log(lakes[,c('spm','TP','Dm')]))
pairs or plot(data)). Consider appropriate transformation to improve linearity of relationships.plot(lakes[,-c(1:2)])
boxplot(scale(lakes[,-c(1:2)]))
#pH is already a log-transform, Vd seems symmetrically distributed, all others are skewed and are log-transformed
names(lakes)
## [1] "lakename" "lakeno" "spm" "area" "Dm" "Dmax"
## [7] "pH" "TP" "T" "Q" "DR" "Vd"
lakes[,-c(1,2,7,12)]<-log(lakes[,-c(1,2,7,12)])
plot(lakes[,-c(1:2)])
boxplot(scale(lakes[,-c(1:2)]))
# distributions have improved, some linear relationships obvious - with response as well as among predictors (collinearity)
vif from the car-package you can assess collinearity. Identify redundant variables using VIF and drop them from the dataset.lakes<-lakes[,-c(1:2)]
mlr.full<-lm(spm~.,data=lakes) # makes a model with all predictors available
library(car)
## Loading required package: carData
vif(mlr.full) #computes VIF
## area Dm Dmax pH TP T
## 1.259843e+06 9.310628e+05 3.947461e+02 2.337318e+00 3.883131e+00 3.576987e+02
## Q DR Vd
## 5.275149e+02 1.034419e+06 4.269601e+01
# check what VIF means:
summary(lm(area~.-spm,data=lakes)) # regressing area on all other predictors gives an R2 of (almost) 1!
##
## Call:
## lm(formula = area ~ . - spm, data = lakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0053403 -0.0008962 0.0003460 0.0008821 0.0040917
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.894e-04 5.712e-02 -0.010 0.992
## Dm 2.005e+00 1.526e-02 131.403 <2e-16 ***
## Dmax -2.862e-03 9.258e-03 -0.309 0.761
## pH 1.243e-04 1.002e-03 0.124 0.903
## TP 7.409e-05 1.685e-03 0.044 0.965
## T -2.094e-04 3.995e-03 -0.052 0.959
## Q -1.008e-04 3.802e-03 -0.027 0.979
## DR 1.999e+00 7.636e-03 261.836 <2e-16 ***
## Vd 9.416e-04 8.395e-03 0.112 0.912
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.002301 on 17 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.677e+06 on 8 and 17 DF, p-value: < 2.2e-16
# some really high VIF point to collinearity, start removing predictors and repeat procedure.
# full model without area
mlr.full<-lm(spm~.-area,data=lakes)
vif(mlr.full)
## Dm Dmax pH TP T Q DR
## 915.771551 392.539697 2.335205 3.882689 357.640871 527.493110 256.436037
## Vd
## 42.664441
# full model without area and Dm
mlr.full<-lm(spm~.-area-Dm,data=lakes)
vif(mlr.full)
## Dmax pH TP T Q DR Vd
## 224.429091 2.326471 3.707657 130.252655 191.099549 88.934283 27.887176
# -> still high VIFs for some predictors, drop Dmax
# full model without Dm and area and Dmax
mlr.full<-lm(spm~.-area-Dm-Dmax,data=lakes)
vif(mlr.full)
## pH TP T Q DR Vd
## 2.289150 3.480125 2.393403 2.406980 2.435265 2.249989
# -> now all predictors have VIF<10 and make sense as input for MLR
# note that DR and Vd (which remain included) are lake morphometric measures computed from area, Dm and Dmax!
# delete the unnecessary predictors from the dataframe
which(names(lakes) %in% c("area","Dm","Dmax"))
## [1] 2 3 4
lakes<-lakes[,-c(2,3,4)]
mlr.full<-lm(spm~.,data=lakes) # makes a model with all predictors available
summary(mlr.full)
##
## Call:
## lm(formula = spm ~ ., data = lakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.92577 -0.11178 0.02268 0.20035 0.76549
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.60757 1.29121 -2.794 0.01158 *
## pH 0.25790 0.16681 1.546 0.13860
## TP 1.02714 0.26827 3.829 0.00113 **
## T -0.04065 0.05496 -0.740 0.46863
## Q -0.02767 0.04319 -0.641 0.52947
## DR 0.36796 0.12515 2.940 0.00840 **
## Vd 0.31319 0.32422 0.966 0.34619
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3871 on 19 degrees of freedom
## Multiple R-squared: 0.8797, Adjusted R-squared: 0.8417
## F-statistic: 23.15 on 6 and 19 DF, p-value: 8.788e-08
# -> this is "maximum" model with lots of probably unimportant predictors, maximum "achievable" r2 is now 0.88
plot(lakes)
simple.mod1<-lm(spm~TP,data=lakes)
summary(simple.mod1)
##
## Call:
## lm(formula = spm ~ TP, data = lakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90449 -0.28309 -0.08777 0.17917 1.20813
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.8724 0.5793 -6.685 6.48e-07 ***
## TP 1.5517 0.1891 8.206 2.00e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5089 on 24 degrees of freedom
## Multiple R-squared: 0.7373, Adjusted R-squared: 0.7263
## F-statistic: 67.35 on 1 and 24 DF, p-value: 2.004e-08
plot(spm~TP,data=lakes)
abline(simple.mod1,col="red")
# -> very good relationship, high r2
add1and drop1 terms can be added to or removed from models. Follow a forward and a backward model building process until a final result and compare the two final models.add1(simple.mod1,scope=~pH+TP+T+Q+DR+Vd,test="F")
## Single term additions
##
## Model:
## spm ~ TP
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 6.2163 -33.204
## pH 1 2.02398 4.1923 -41.446 11.1041 0.002896 **
## T 1 0.41143 5.8048 -32.985 1.6302 0.214421
## Q 1 0.00925 6.2070 -31.243 0.0343 0.854777
## DR 1 1.96782 4.2484 -41.100 10.6533 0.003414 **
## Vd 1 0.79150 5.4248 -34.745 3.3558 0.079950 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# -> pH and DR seem to be good candidates for adding to the model
# -> add pH (its inclusion alongside TP causes strongest change in model according to F-value and produces lowest AIC)
mlr1<-lm(spm~TP+pH,data=lakes)
summary(mlr1)
##
## Call:
## lm(formula = spm ~ TP + pH, data = lakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.86284 -0.24187 -0.04459 0.18318 0.91888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.5089 0.9285 -7.010 3.83e-07 ***
## TP 1.4664 0.1607 9.127 4.16e-09 ***
## pH 0.4105 0.1232 3.332 0.0029 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4269 on 23 degrees of freedom
## Multiple R-squared: 0.8228, Adjusted R-squared: 0.8074
## F-statistic: 53.4 on 2 and 23 DF, p-value: 2.275e-09
# -> good model, try adding more terms
add1(mlr1,scope~pH+TP+T+Q+DR+Vd,test="F")
## Single term additions
##
## Model:
## spm ~ TP + pH
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 4.1923 -41.446
## T 1 0.00252 4.1898 -39.462 0.0132 0.90954
## Q 1 0.02363 4.1686 -39.593 0.1247 0.72735
## DR 1 0.92796 3.2643 -45.951 6.2540 0.02033 *
## Vd 1 0.03692 4.1554 -39.676 0.1955 0.66270
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# -> add DR
mlr2<-lm(spm~TP+pH+DR,data=lakes)
summary(mlr2)
##
## Call:
## lm(formula = spm ~ TP + pH + DR, data = lakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94258 -0.17068 0.02951 0.18948 0.85895
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.4940 1.1623 -3.866 0.000835 ***
## TP 1.1434 0.1942 5.889 6.32e-06 ***
## pH 0.3058 0.1188 2.575 0.017261 *
## DR 0.2871 0.1148 2.501 0.020332 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3852 on 22 degrees of freedom
## Multiple R-squared: 0.862, Adjusted R-squared: 0.8432
## F-statistic: 45.82 on 3 and 22 DF, p-value: 1.247e-09
# -> good model, all terms significant try adding more terms
add1(mlr2,scope~pH+TP+T+Q+DR+Vd,test="F")
## Single term additions
##
## Model:
## spm ~ TP + pH + DR
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 3.2643 -45.951
## T 1 0.009151 3.2552 -44.024 0.0590 0.8104
## Q 1 0.225708 3.0386 -45.814 1.5599 0.2254
## Vd 1 0.291287 2.9730 -46.381 2.0575 0.1662
# -> no more terms to add according to F -> mlr2 is the final model (as in the original publication)
# -> not that according to AIC we could add Vd and keep building more complicated model
# alternatively with dropping terms from the full model
summary(mlr.full)
##
## Call:
## lm(formula = spm ~ ., data = lakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.92577 -0.11178 0.02268 0.20035 0.76549
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.60757 1.29121 -2.794 0.01158 *
## pH 0.25790 0.16681 1.546 0.13860
## TP 1.02714 0.26827 3.829 0.00113 **
## T -0.04065 0.05496 -0.740 0.46863
## Q -0.02767 0.04319 -0.641 0.52947
## DR 0.36796 0.12515 2.940 0.00840 **
## Vd 0.31319 0.32422 0.966 0.34619
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3871 on 19 degrees of freedom
## Multiple R-squared: 0.8797, Adjusted R-squared: 0.8417
## F-statistic: 23.15 on 6 and 19 DF, p-value: 8.788e-08
# -> unreasonable model with many insignificant terms
drop1(mlr.full,test="F")
## Single term deletions
##
## Model:
## spm ~ pH + TP + T + Q + DR + Vd
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 2.8466 -43.511
## pH 1 0.35810 3.2047 -42.430 2.3901 0.138597
## TP 1 2.19631 5.0429 -30.643 14.6594 0.001133 **
## T 1 0.08194 2.9286 -44.773 0.5469 0.468633
## Q 1 0.06147 2.9081 -44.955 0.4103 0.529466
## DR 1 1.29509 4.1417 -35.762 8.6441 0.008404 **
## Vd 1 0.13980 2.9864 -44.264 0.9331 0.346193
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# -> drop Q (lowest F, highest P, also makes lowest AIC)
mlr3<-lm(spm~pH+TP+T+DR+Vd,data=lakes)
summary(mlr3)
##
## Call:
## lm(formula = spm ~ pH + TP + T + DR + Vd, data = lakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88293 -0.20247 0.00856 0.23363 0.81997
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.80213 1.23633 -3.075 0.00597 **
## pH 0.21254 0.14880 1.428 0.16861
## TP 0.98497 0.25620 3.844 0.00101 **
## T -0.03584 0.05364 -0.668 0.51164
## DR 0.35463 0.12158 2.917 0.00853 **
## Vd 0.42131 0.27270 1.545 0.13804
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3813 on 20 degrees of freedom
## Multiple R-squared: 0.8771, Adjusted R-squared: 0.8464
## F-statistic: 28.54 on 5 and 20 DF, p-value: 1.869e-08
drop1(mlr3,test="F")
## Single term deletions
##
## Model:
## spm ~ pH + TP + T + DR + Vd
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 2.9081 -44.955
## pH 1 0.29667 3.2048 -44.430 2.0403 0.168611
## TP 1 2.14909 5.0572 -32.569 14.7800 0.001012 **
## T 1 0.06492 2.9730 -46.381 0.4465 0.511638
## DR 1 1.23717 4.1453 -37.739 8.5084 0.008526 **
## Vd 1 0.34706 3.2552 -44.024 2.3868 0.138038
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# -> drop T
mlr4<-lm(spm~pH+TP+DR+Vd,data=lakes)
summary(mlr4)
##
## Call:
## lm(formula = spm ~ pH + TP + DR + Vd, data = lakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.82209 -0.18123 -0.00184 0.21378 0.85912
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.8703 1.2158 -3.183 0.00447 **
## pH 0.1878 0.1422 1.321 0.20081
## TP 1.0959 0.1925 5.693 1.19e-05 ***
## DR 0.3433 0.1188 2.890 0.00876 **
## Vd 0.3709 0.2586 1.434 0.16618
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3763 on 21 degrees of freedom
## Multiple R-squared: 0.8743, Adjusted R-squared: 0.8504
## F-statistic: 36.53 on 4 and 21 DF, p-value: 3.542e-09
drop1(mlr4,test="F")
## Single term deletions
##
## Model:
## spm ~ pH + TP + DR + Vd
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 2.9730 -46.381
## pH 1 0.2469 3.2200 -46.307 1.7443 0.200810
## TP 1 4.5877 7.5608 -24.113 32.4054 1.192e-05 ***
## DR 1 1.1823 4.1554 -39.676 8.3513 0.008764 **
## Vd 1 0.2913 3.2643 -45.951 2.0575 0.166183
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# according to F we should drop pH, according to AIC this is already the best model
# rule of thumb: AIC differences <2 are not reliable, thus drop pH
mlr5<-lm(spm~TP+DR+Vd,data=lakes)
summary(mlr5)
##
## Call:
## lm(formula = spm ~ TP + DR + Vd, data = lakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78084 -0.20451 -0.01827 0.19437 0.88301
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.5364 0.6881 -3.686 0.001293 **
## TP 1.0455 0.1919 5.449 1.79e-05 ***
## DR 0.4158 0.1071 3.881 0.000805 ***
## Vd 0.5685 0.2145 2.651 0.014600 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3826 on 22 degrees of freedom
## Multiple R-squared: 0.8639, Adjusted R-squared: 0.8453
## F-statistic: 46.55 on 3 and 22 DF, p-value: 1.074e-09
drop1(mlr5,test="F")
## Single term deletions
##
## Model:
## spm ~ TP + DR + Vd
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 3.2200 -46.307
## TP 1 4.3462 7.5662 -26.095 29.6947 1.788e-05 ***
## DR 1 2.2048 5.4248 -34.745 15.0638 0.0008054 ***
## Vd 1 1.0285 4.2484 -41.100 7.0268 0.0146000 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# -> no more terms to drop
# -> final model is different between forward and backward variable selection, even though 2 of 3 predictors are identical!
step to identify a good model based on the AIC. How does this compare to the models created above? Check model assumptions and decide for a final model.step(mlr.full,direction="backward")
## Start: AIC=-43.51
## spm ~ pH + TP + T + Q + DR + Vd
##
## Df Sum of Sq RSS AIC
## - Q 1 0.06147 2.9081 -44.955
## - T 1 0.08194 2.9286 -44.773
## - Vd 1 0.13980 2.9864 -44.264
## <none> 2.8466 -43.511
## - pH 1 0.35810 3.2047 -42.430
## - DR 1 1.29509 4.1417 -35.762
## - TP 1 2.19631 5.0429 -30.643
##
## Step: AIC=-44.96
## spm ~ pH + TP + T + DR + Vd
##
## Df Sum of Sq RSS AIC
## - T 1 0.06492 2.9730 -46.381
## <none> 2.9081 -44.955
## - pH 1 0.29667 3.2048 -44.430
## - Vd 1 0.34706 3.2552 -44.024
## - DR 1 1.23717 4.1453 -37.739
## - TP 1 2.14909 5.0572 -32.569
##
## Step: AIC=-46.38
## spm ~ pH + TP + DR + Vd
##
## Df Sum of Sq RSS AIC
## <none> 2.9730 -46.381
## - pH 1 0.2469 3.2200 -46.307
## - Vd 1 0.2913 3.2643 -45.951
## - DR 1 1.1823 4.1554 -39.676
## - TP 1 4.5877 7.5608 -24.113
##
## Call:
## lm(formula = spm ~ pH + TP + DR + Vd, data = lakes)
##
## Coefficients:
## (Intercept) pH TP DR Vd
## -3.8703 0.1878 1.0959 0.3433 0.3709
step(simple.mod1,scope=~pH+TP+T+Q+DR+Vd,direction="forward")
## Start: AIC=-33.2
## spm ~ TP
##
## Df Sum of Sq RSS AIC
## + pH 1 2.02398 4.1923 -41.446
## + DR 1 1.96782 4.2484 -41.100
## + Vd 1 0.79150 5.4248 -34.745
## <none> 6.2163 -33.204
## + T 1 0.41143 5.8048 -32.985
## + Q 1 0.00925 6.2070 -31.243
##
## Step: AIC=-41.45
## spm ~ TP + pH
##
## Df Sum of Sq RSS AIC
## + DR 1 0.92796 3.2643 -45.951
## <none> 4.1923 -41.446
## + Vd 1 0.03692 4.1554 -39.676
## + Q 1 0.02363 4.1686 -39.593
## + T 1 0.00252 4.1898 -39.462
##
## Step: AIC=-45.95
## spm ~ TP + pH + DR
##
## Df Sum of Sq RSS AIC
## + Vd 1 0.291287 2.9730 -46.381
## <none> 3.2643 -45.951
## + Q 1 0.225708 3.0386 -45.814
## + T 1 0.009151 3.2552 -44.024
##
## Step: AIC=-46.38
## spm ~ TP + pH + DR + Vd
##
## Df Sum of Sq RSS AIC
## <none> 2.9730 -46.381
## + T 1 0.064924 2.9081 -44.955
## + Q 1 0.044461 2.9286 -44.773
##
## Call:
## lm(formula = spm ~ TP + pH + DR + Vd, data = lakes)
##
## Coefficients:
## (Intercept) TP pH DR Vd
## -3.8703 1.0959 0.1878 0.3433 0.3709
# same solutions! compare with selection by F!
# combined forward and backward selection based on AIC
step(mlr.full,direction="both")
## Start: AIC=-43.51
## spm ~ pH + TP + T + Q + DR + Vd
##
## Df Sum of Sq RSS AIC
## - Q 1 0.06147 2.9081 -44.955
## - T 1 0.08194 2.9286 -44.773
## - Vd 1 0.13980 2.9864 -44.264
## <none> 2.8466 -43.511
## - pH 1 0.35810 3.2047 -42.430
## - DR 1 1.29509 4.1417 -35.762
## - TP 1 2.19631 5.0429 -30.643
##
## Step: AIC=-44.96
## spm ~ pH + TP + T + DR + Vd
##
## Df Sum of Sq RSS AIC
## - T 1 0.06492 2.9730 -46.381
## <none> 2.9081 -44.955
## - pH 1 0.29667 3.2048 -44.430
## - Vd 1 0.34706 3.2552 -44.024
## + Q 1 0.06147 2.8466 -43.511
## - DR 1 1.23717 4.1453 -37.739
## - TP 1 2.14909 5.0572 -32.569
##
## Step: AIC=-46.38
## spm ~ pH + TP + DR + Vd
##
## Df Sum of Sq RSS AIC
## <none> 2.9730 -46.381
## - pH 1 0.2469 3.2200 -46.307
## - Vd 1 0.2913 3.2643 -45.951
## + T 1 0.0649 2.9081 -44.955
## + Q 1 0.0445 2.9286 -44.773
## - DR 1 1.1823 4.1554 -39.676
## - TP 1 4.5877 7.5608 -24.113
##
## Call:
## lm(formula = spm ~ pH + TP + DR + Vd, data = lakes)
##
## Coefficients:
## (Intercept) pH TP DR Vd
## -3.8703 0.1878 1.0959 0.3433 0.3709
mlr8<-lm(spm~pH+TP+DR+Vd,data=lakes)
# assess the final model candidates: mlr2, mlr5 and mlr8
summary(mlr2)
##
## Call:
## lm(formula = spm ~ TP + pH + DR, data = lakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94258 -0.17068 0.02951 0.18948 0.85895
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.4940 1.1623 -3.866 0.000835 ***
## TP 1.1434 0.1942 5.889 6.32e-06 ***
## pH 0.3058 0.1188 2.575 0.017261 *
## DR 0.2871 0.1148 2.501 0.020332 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3852 on 22 degrees of freedom
## Multiple R-squared: 0.862, Adjusted R-squared: 0.8432
## F-statistic: 45.82 on 3 and 22 DF, p-value: 1.247e-09
summary(mlr5)
##
## Call:
## lm(formula = spm ~ TP + DR + Vd, data = lakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78084 -0.20451 -0.01827 0.19437 0.88301
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.5364 0.6881 -3.686 0.001293 **
## TP 1.0455 0.1919 5.449 1.79e-05 ***
## DR 0.4158 0.1071 3.881 0.000805 ***
## Vd 0.5685 0.2145 2.651 0.014600 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3826 on 22 degrees of freedom
## Multiple R-squared: 0.8639, Adjusted R-squared: 0.8453
## F-statistic: 46.55 on 3 and 22 DF, p-value: 1.074e-09
summary(mlr8)
##
## Call:
## lm(formula = spm ~ pH + TP + DR + Vd, data = lakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.82209 -0.18123 -0.00184 0.21378 0.85912
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.8703 1.2158 -3.183 0.00447 **
## pH 0.1878 0.1422 1.321 0.20081
## TP 1.0959 0.1925 5.693 1.19e-05 ***
## DR 0.3433 0.1188 2.890 0.00876 **
## Vd 0.3709 0.2586 1.434 0.16618
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3763 on 21 degrees of freedom
## Multiple R-squared: 0.8743, Adjusted R-squared: 0.8504
## F-statistic: 36.53 on 4 and 21 DF, p-value: 3.542e-09
extractAIC(mlr2)
## [1] 4.00000 -45.95116
extractAIC(mlr5)
## [1] 4.00000 -46.30677
extractAIC(mlr8)
## [1] 5.00000 -46.38135
# -> r2 and AIC are not very different, not enough reason to use more complex model (4 vs. 3 variables!)
# AIC differences between models <2 indicate "equivalent" models
# -> choose mlr2 or mlr5
# will use mlr2
i = 1
ypred_cv<-rep(NA,nrow(lakes))
for(i in 1:nrow(lakes)) {
mod <- lm(spm ~ TP + pH + DR, data = lakes[-i, ])
ypred_cv[i]<-predict(mod, newdata = lakes[i,])
}
plot(lakes$spm,ypred_cv)
abline(a=0,b=1,col="red")
cor.test(lakes$spm,ypred_cv)
##
## Pearson's product-moment correlation
##
## data: lakes$spm and ypred_cv
## t = 10.129, df = 24, p-value = 3.822e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7874937 0.9546839
## sample estimates:
## cor
## 0.9002392
conclude: mlr2 is able to make reasonably good predictons even for left-out cases
These data are from a study of the response of trees in North America to climate change (Talluto, et al. (2017) Extinction debt and colonization credit delay range shifts of eastern North American trees. Nat Ecol Evol 1, 0182 (2017). https://doi.org/10.1038/s41559-017-0182).
Load the somewhat deconstructed dataset, containing climate data, data on tree abundance and mortality, and a species table:
clim = read.csv("data/tree_clim.csv")
trees = read.csv("data/trees.csv")
species = read.csv("data/tree_species.csv")
We have two potential hypotheses to explore:
1: Mortality of sugar maple (Acer saccharum) is negatively related to temperature and/or climatic variability.
2: Abundance of sugar maple has a hump-shaped (quadratic) relationship to climate (i.e., there is a climatic optimum)
Choose one of the two to work on. 1 implies a binomial process (k trees died out of n trees present), and will have cbind(died, n) as the response variable. 2 implies a poisson process and will have n as the response. Formulate a more specific alternative hypothesis.
The data will need to be reshaped a bit before use.
reshape2::dcast to produce a “wide” format dataset. You have 6 variables, measured in plots across multiple years.Measures of climate location
Measures of climate variability
Multicollinearity is usually a problem with climate variables, because many variables measure very similar things. However, the data set is very large, and so variance inflation is somewhat less of a problem. You should still choose a smaller number of variables to test (max 4); justify this choice based on a hypothesis. You can use cor(clim[, -c(1,2)]) to make a correlation matrix; if two variables have a correlation > 0.6, you should only use one of them. You may (optionally) also check VIFs, but you should do so after fitting the model.
library(reshape2)
clim_wide = dcast(clim, plot_id+year_measured ~ variable)
cor(clim_wide[,-(1:2)])
## annual_mean_temp mean_diurnal_range
## annual_mean_temp 1.0000000 0.56455636
## mean_diurnal_range 0.5645564 1.00000000
## mean_temp_wettest_quarter 0.6576496 0.44545131
## pp_seasonality 0.3349464 0.28917816
## pp_warmest_quarter 0.2391688 0.08772849
## tot_annual_pp 0.5952070 0.25246700
## mean_temp_wettest_quarter pp_seasonality
## annual_mean_temp 0.65764958 0.33494644
## mean_diurnal_range 0.44545131 0.28917816
## mean_temp_wettest_quarter 1.00000000 0.66255407
## pp_seasonality 0.66255407 1.00000000
## pp_warmest_quarter 0.40558332 0.04250791
## tot_annual_pp 0.04818496 -0.31387628
## pp_warmest_quarter tot_annual_pp
## annual_mean_temp 0.23916885 0.59520697
## mean_diurnal_range 0.08772849 0.25246700
## mean_temp_wettest_quarter 0.40558332 0.04818496
## pp_seasonality 0.04250791 -0.31387628
## pp_warmest_quarter 1.00000000 0.48682271
## tot_annual_pp 0.48682271 1.00000000
I will select 4 variables, 1 each of temperature and precipitation, location and variability. I start with the variability, since there are only 2 variables I select them both (mean_diurnal_range and pp_seasonality). For temperature location, mean_temp_wettest_quarter is strongly correlated with pp_seasonaility, so I go with annual_mean_temp. For precipitation, pp_warmest_quarter is the most unique variable remaining, and has the added benefit of having a nice interpretation with regard to drought stress (low pp_warmest_quarter = high drought stress).
## note that it is less error-prone if you refer to variables by name.
vars = c("plot_id", "year_measured", "annual_mean_temp", "mean_diurnal_range", "pp_seasonality", "pp_warmest_quarter")
my_clim_vars = clim_wide[, vars]
trees = merge(trees, my_clim_vars, by.x = c("plot", "year"),
by.y= c("plot_id", "year_measured"))
trees = subset(trees, species_code == "28731-ACE-SAC") ## this is the species code for sugar maple
glm function with the family = poisson or family = 'binomial' option, depending on your response). Use AIC(mod) to compare different hypotheses about which variables to include. Choose a final ‘best’ model, and report the results using appropriate tables, figures, etc. If you use a poisson model, be sure to check for overdispersion in each of your models (it is not necessary to correct if it is too high in your final model, but do report it).Binomial models need to know what the count of events is (here, deaths, because we are modelling mortality), but they also need to know the number of “trials;” in this case, we can’t observe more deaths than there were actual trees, so the variable is n. We use this to make a combined response variable: cbind(died, n)
## biggest model, including all variables
mod_h1_1 = glm(cbind(died, n) ~ annual_mean_temp + mean_diurnal_range + pp_seasonality + pp_warmest_quarter, data = trees, family=binomial)
## only location, no variability
mod_h1_2 = glm(cbind(died, n) ~ annual_mean_temp + pp_warmest_quarter, data = trees, family=binomial)
## variability only
mod_h1_3 = glm(cbind(died, n) ~ mean_diurnal_range + pp_seasonality, data = trees, family=binomial)
## temperature only
mod_h1_4 = glm(cbind(died, n) ~ annual_mean_temp + mean_diurnal_range, data = trees, family=binomial)
## With all 2-way interactions
mod_h1_int = glm(cbind(died, n) ~ (annual_mean_temp + mean_diurnal_range + pp_seasonality + pp_warmest_quarter)^2, data = trees, family=binomial)
mods_h1 = data.frame(model = c("mod_h1_1", "mod_h1_2", "mod_h1_3", "mod_h1_4", "mod_h1_int"),
aic = c(AIC(mod_h1_1), AIC(mod_h1_2), AIC(mod_h1_3),
AIC(mod_h1_4), AIC(mod_h1_int)))
mods_h1$dAIC = mods_h1$aic - min(mods_h1$aic)
mods_h1
## model aic dAIC
## 1 mod_h1_1 15645.74 80.26063
## 2 mod_h1_2 15730.74 165.26185
## 3 mod_h1_3 15665.49 100.01677
## 4 mod_h1_4 15751.62 186.14545
## 5 mod_h1_int 15565.48 0.00000
coef(mod_h1_1)
## (Intercept) annual_mean_temp mean_diurnal_range pp_seasonality
## -2.577322327 0.011936138 0.045388491 0.011135541
## pp_warmest_quarter
## -0.001051458
new_trees = trees
new_trees[, c("annual_mean_temp", "mean_diurnal_range", "pp_seasonality", "pp_warmest_quarter")] =
scale(new_trees[, c("annual_mean_temp", "mean_diurnal_range", "pp_seasonality", "pp_warmest_quarter")])
mod_h1_1_sc = glm(cbind(died, n) ~ annual_mean_temp + mean_diurnal_range + pp_seasonality + pp_warmest_quarter, data = new_trees, family=binomial)
The first model, with all 4 variables, is best. looking at the coefficients, we find that more variability in climate results in increased mortality. We can make response curves by generating some fake data. We set other variables to their mean when computing the predictions.
x_plot = data.frame(annual_mean_temp = seq(min(trees$annual_mean_temp), max(trees$annual_mean_temp), length.out = 500),
mean_diurnal_range = mean(trees$mean_diurnal_range), pp_seasonality = mean(trees$pp_seasonality),
pp_warmest_quarter = mean(trees$pp_warmest_quarter))
# use the predict function to find out what we expect for mortality if x takes hypothetical values
# type = 'response' takes care of transforming away the link function, otherwise we would get predictions on the logit scale
y_plot = predict(mod_h1_1, type='response', newdata = x_plot)
par(mfrow = c(2,2), bty='n')
plot(x_plot$annual_mean_temp, y_plot, type='l', xlab='Annual Mean Temperature', ylab = "Mortality Probability", ylim=c(0,1), col='blue', lwd=2)
points(trees$annual_mean_temp, trees$died/trees$n, pch=16, cex=0.4)
## now just edit the appropriate columns to make the rest of the plots
for(i in 2:4) {
nm = colnames(x_plot)[i]
nm_pr = colnames(x_plot)[i-1]
x_plot[,nm_pr] = mean(trees[,nm_pr]) # set the previous column to the mean
# set the next to a sequence
x_plot[, nm] = seq(min(trees[,nm]), max(trees[,nm]), length.out = 500)
y_plot = predict(mod_h1_1, type='response', newdata = x_plot)
plot(x_plot[,nm], y_plot, type='l', xlab=nm, ylab = "Mortality Probability", ylim=c(0,1), col='blue', lwd=2)
points(trees[,nm], trees$died/trees$n, pch=16, cex=0.4)
}
Here we check for hump-shaped relationships between the number of trees and climate. The number of trees is a count, so we use a Poisson model. I choose only to include quadratic terms for climate location; it makes sense that variability is either good or bad, not necessarily that there is an interpediate optimum.
## biggest model, including all variables and a squared term for location
mod_h2_1 = glm(n ~ annual_mean_temp + mean_diurnal_range + pp_seasonality + pp_warmest_quarter + I(annual_mean_temp^2) + I(pp_warmest_quarter^2), data = trees, family=poisson)
## only location, no variability
mod_h2_2 = glm(n ~ annual_mean_temp + pp_warmest_quarter + I(annual_mean_temp^2) + I(pp_warmest_quarter^2), data = trees, family=poisson)
## location, linear only
mod_h2_3 = glm(n ~ annual_mean_temp + pp_warmest_quarter, data = trees, family=poisson)
## temperature only
mod_h2_4 = glm(n ~ annual_mean_temp + mean_diurnal_range + I(annual_mean_temp^2), data = trees, family=poisson)
mods_h2 = data.frame(model = c("mod_h2_1", "mod_h2_2", "mod_h2_3", "mod_h2_4"),
aic = c(AIC(mod_h2_1), AIC(mod_h2_2), AIC(mod_h2_3), AIC(mod_h2_4)))
mods_h2$dAIC = mods_h2$aic - min(mods_h2$aic)
mods_h2
## model aic dAIC
## 1 mod_h2_1 42017.68 0.00000
## 2 mod_h2_2 42066.10 48.42201
## 3 mod_h2_3 42386.89 369.20384
## 4 mod_h2_4 43720.70 1703.02185
coef(mod_h2_1)
## (Intercept) annual_mean_temp mean_diurnal_range
## 7.055290e+00 -1.087067e-01 5.738812e-02
## pp_seasonality pp_warmest_quarter I(annual_mean_temp^2)
## 1.320556e-03 -2.999770e-02 1.426522e-03
## I(pp_warmest_quarter^2)
## 4.416641e-05
The first model, with all 4 variables, is by far the best. looking at the coefficients, we find that more variability in climate results in increased mortality. Interpreting the coefficients directly with such a model is difficult, we will need to check the response curves.
x_plot = data.frame(annual_mean_temp = seq(min(trees$annual_mean_temp), max(trees$annual_mean_temp), length.out = 500),
mean_diurnal_range = mean(trees$mean_diurnal_range), pp_seasonality = mean(trees$pp_seasonality),
pp_warmest_quarter = mean(trees$pp_warmest_quarter))
# use the predict function to find out what we expect for mortality if x takes hypothetical values
# type = 'response' takes care of transforming away the link function, otherwise we would get predictions on the logit scale
y_plot = predict(mod_h2_1, type='response', newdata = x_plot)
par(mfrow = c(2,2), bty='n')
plot(x_plot$annual_mean_temp, y_plot, type='l', xlab='Annual Mean Temperature', ylab = "Expected Number of Trees")
## now just edit the appropriate columns to make the rest of the plots
for(i in 2:4) {
nm = colnames(x_plot)[i]
nm_pr = colnames(x_plot)[i-1]
x_plot[,nm_pr] = mean(trees[,nm_pr]) # set the previous column to the mean
# set the next to a sequence
x_plot[, nm] = seq(min(trees[,nm]), max(trees[,nm]), length.out = 500)
y_plot = predict(mod_h2_1, type='response', newdata = x_plot)
plot(x_plot[,nm], y_plot, type='l', xlab=nm, ylab = "Expected Number of Trees")
}
The quadratic terms produce some rather unexpected predictions. In particular, not only do we find no evidence of an optimum, but there is an anti-optimum with regards to precipitation! The model predicts the least number of trees at intermediate precipitations. Likely we have not done an adequate job building this model, and so we will need to explore further hypotheses…
But for the moment, we have another problem. Poisson models assume the variance is equal to the mean. We can check this with the “dispersion parameter,” which should be equal to one. Compute it as the ratio of residual deviance to the degrees of freedom. For comparison, we can do this for all models.
mod_h2_1$deviance / mod_h2_1$df.residual
## [1] 4.675947
mod_h2_2$deviance / mod_h2_2$df.residual
## [1] 4.684589
mod_h2_3$deviance / mod_h2_3$df.residual
## [1] 4.747849
mod_h2_4$deviance / mod_h2_4$df.residual
## [1] 5.015966
As the models get simpler, the dispersion parameter gets worse (simpler models describe less deviance). In all cases, the dispersion parameters are quite a bit greater than one, so (in a complete analysis), we would need to consider more variables, or consider another model structure. Unfortunately, the counts are much too close to zero to use a normal approximation.
hist(trees$n)